Dog Breed Prediction using Trasnfer Learing on SageMaker

This notebook walks through implementation of Image Classification Machine Learning Model to classify between 133 kinds of dog breeds using dog breed dataset provided by Udacity (https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip)

  • We will be using a pretrained Resnet50 model from pytorch vision library (https://pytorch.org/vision/master/generated/torchvision.models.resnet50.html)
  • We will be adding in two Fully connected Neural network Layers on top of the above Resnet50 model.
  • Note: We will be using concepts of Transfer learning and so we will be freezing all the exisiting Convolutional layers in the pretrained resnet50 model and only changing gradients for the tow fully connected layers that we have added.
  • Then we will perform Hyperparameter tuning, to help figure out the best hyperparameters to be used for our model.
  • Next we will be using the best hyperparameters and fine-tuning our Resent50 model.
  • We will also be adding in configuration for Profiling and Debugging our training mode by adding in relevant hooks in the Training and Testing( Evaluation) phases.
  • Next we will be deploying our model. While deploying we will create our custom inference script. The custom inference script will be overriding a few functions that will be used by our deployed endpoint for making inferences/predictions.
  • Finally we will be testing out our model with some test images of dogs, to verfiy if the model is working as per our expectations.
In [2]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug
Collecting smdebug
  Using cached smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
Collecting pyinstrument==3.4.2
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (3.20.1)
Requirement already satisfied: boto3>=1.10.32 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.22.2)
Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.21.6)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from smdebug) (20.1)
Collecting pyinstrument-cext>=0.2.2
  Using cached pyinstrument_cext-0.2.4-cp37-cp37m-manylinux2010_x86_64.whl (20 kB)
Requirement already satisfied: botocore<1.26.0,>=1.25.2 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.25.2)
Requirement already satisfied: s3transfer<0.6.0,>=0.5.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (0.5.2)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.0.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (2.4.6)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (1.14.0)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.26.0,>=1.25.2->boto3>=1.10.32->smdebug) (1.26.9)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.26.0,>=1.25.2->boto3>=1.10.32->smdebug) (2.8.1)
Installing collected packages: pyinstrument-cext, pyinstrument, smdebug
Successfully installed pyinstrument-3.4.2 pyinstrument-cext-0.2.4 smdebug-1.0.12
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.1.2 is available.
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.
In [3]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.session import Session
from sagemaker import get_execution_role
# Initializing some useful variables
role = get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
print(f"Region {region}")
print(f"Default s3 bucket : {bucket}")
Region us-east-1
Default s3 bucket : sagemaker-us-east-1-881607171913

Dataset

The dataset we used for this project is the dogImages dataset that can be found in this link. It contains images of 133 dog breed split into train, valid and test folders each containing a sample of every breed. An example from the train folder s is ./dogImages/test/018.Beauceron/Beauceron_01284.jpg

In [9]:
#TODO: Fetch and upload the data to AWS S3

!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip  > /dev/null
--2022-06-07 16:50:03--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.193.8
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.193.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip’

dogImages.zip       100%[===================>]   1.05G  40.3MB/s    in 37s     

2022-06-07 16:50:43 (29.0 MB/s) - ‘dogImages.zip’ saved [1132023110/1132023110]

In [10]:
prefix ="dogImagesDataset"
print("Starting to upload dogImages")

inputs = sagemaker_session.upload_data(path="dogImages", bucket=bucket, key_prefix=prefix)
print(f"Input path ( S3 file path ): {inputs}")
Starting to upload dogImages
Input path ( S3 file path ): s3://sagemaker-us-east-1-881607171913/dogImagesDataset
In [11]:
inputs = 's3://sagemaker-us-east-1-881607171913/dogImagesDataset'
print(f"Input path ( S3 file path ): {inputs}")
Input path ( S3 file path ): s3://sagemaker-us-east-1-881607171913/dogImagesDataset

Hyperparameter Tuning

The ResNet50 model with a two Fully connected Linear NN layer's is used for this image classification problem. ResNet-50 is 50 layers deep and is trained on a million images of 1000 categories from the ImageNet database. Furthermore the model has a lot of trainable parameters, which indicates a deep architecture that makes it better for image recognition The optimizer that we will be using for this model is AdamW ( For more info refer : https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html ) Hence, the hyperparameters selected for tuning were: Learning rate - default(x) is 0.001 , so we have selected 0.01x to 100x range for the learing rate eps - defaut is 1e-08 , which is acceptable in most cases so we have selected a range of 1e-09 to 1e-08 Weight decay - default(x) is 0.01 , so we have selected 0.1x to 10x range for the weight decay Batch size -- selected only two values [ 64, 128 ]

In [12]:
#Importing all the required modules fomr tuner
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

# We wil be using AdamW as an optimizer which uses a different( more correct or better) way to calulate the weight decay related computations
# So we will be using weight_decay and eps hyperparamter tuning as well , along with the lerning rate and batchsize params
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.0001, 0.1),
    "eps": ContinuousParameter(1e-9, 1e-8),
    "weight_decay": ContinuousParameter(1e-3, 1e-1),
    "batch_size": CategoricalParameter([ 64, 128]),
}
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]
In [15]:
from  sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point = "hpo.py",
    base_job_name = "dog-breed-classification-hpo",
    role = role,
    instance_count = 1,
    instance_type = "ml.g4dn.xlarge",
    py_version = "py36",
    framework_version = "1.8"
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=1,
    objective_type=objective_type, 
    early_stopping_type="Auto"
)
In [16]:
# TODO: Fit your HP Tuner
tuner.fit({"training": inputs }, wait=True)
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!
In [17]:
# Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()
2022-06-07 17:49:03 Starting - Preparing the instances for training
2022-06-07 17:49:03 Downloading - Downloading input data
2022-06-07 17:49:03 Training - Training image download completed. Training in progress.
2022-06-07 17:49:03 Uploading - Uploading generated training model
2022-06-07 17:49:03 Completed - Training job completed
Out[17]:
{'_tuning_objective_metric': '"average test loss"',
 'batch_size': '"64"',
 'eps': '6.907014853153287e-09',
 'lr': '0.00019923165693702715',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"dog-breed-classification-hpo-2022-06-07-17-08-40-632"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-east-1"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-881607171913/dog-breed-classification-hpo-2022-06-07-17-08-40-632/source/sourcedir.tar.gz"',
 'weight_decay': '0.01617333419372169'}
In [18]:
best_hyperparameters={'batch_size': int(best_estimator.hyperparameters()['batch_size'].replace('"', "")),
                      'eps': best_estimator.hyperparameters()['eps'],
                      'lr': best_estimator.hyperparameters()['lr'],
                      'weight_decay': best_estimator.hyperparameters()['weight_decay'],}
print(f"Best Hyperparamters post Hyperparameter fine tuning are : \n {best_hyperparameters}")
Best Hyperparamters post Hyperparameter fine tuning are : 
 {'batch_size': 64, 'eps': '6.907014853153287e-09', 'lr': '0.00019923165693702715', 'weight_decay': '0.01617333419372169'}

Model Profiling and Debugging

In [19]:
# Setting up debugger and profiler rules and configs
from sagemaker.debugger import (
    Rule,
    rule_configs, 
    ProfilerRule,
    DebuggerHookConfig,
    CollectionConfig,
    ProfilerConfig,
    FrameworkProfile
)


rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0",parameters={
    "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "10","eval.save_interval": "1"})]

debugger_config=DebuggerHookConfig( collection_configs=collection_configs )
In [20]:
# Create and fit an estimator
estimator = PyTorch(
    entry_point="train_model.py",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    role=role,
    framework_version="1.6", #using 1.6 as it has support for smdebug lib , https://github.com/awslabs/sagemaker-debugger#debugger-supported-frameworks
    py_version="py36",
    hyperparameters=best_hyperparameters,
    profiler_config=profiler_config, # include the profiler hook
    debugger_hook_config=debugger_config, # include the debugger hook
    rules=rules
)

estimator.fit({'train' : inputs },wait=True)
2022-06-07 18:02:55 Starting - Starting the training job...VanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
...
2022-06-07 18:03:51 Starting - Preparing the instances for training.........
2022-06-07 18:05:25 Downloading - Downloading input data.........
2022-06-07 18:06:55 Training - Downloading the training image.....bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-06-07 18:07:33,386 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-06-07 18:07:33,408 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-06-07 18:07:33,415 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-06-07 18:07:33,907 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch_size": 64,
        "eps": "6.907014853153287e-09",
        "lr": "0.00019923165693702715",
        "weight_decay": "0.01617333419372169"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "pytorch-training-2022-06-07-18-02-54-713",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/source/sourcedir.tar.gz",
    "module_name": "train_model",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.g4dn.xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.g4dn.xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train_model.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":64,"eps":"6.907014853153287e-09","lr":"0.00019923165693702715","weight_decay":"0.01617333419372169"}
SM_USER_ENTRY_POINT=train_model.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train_model
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":64,"eps":"6.907014853153287e-09","lr":"0.00019923165693702715","weight_decay":"0.01617333419372169"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2022-06-07-18-02-54-713","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/source/sourcedir.tar.gz","module_name":"train_model","network_interface_name":"eth0","num_cpus":4,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.g4dn.xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.g4dn.xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train_model.py"}
SM_USER_ARGS=["--batch_size","64","--eps","6.907014853153287e-09","--lr","0.00019923165693702715","--weight_decay","0.01617333419372169"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_BATCH_SIZE=64
SM_HP_EPS=6.907014853153287e-09
SM_HP_LR=0.00019923165693702715
SM_HP_WEIGHT_DECAY=0.01617333419372169
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 train_model.py --batch_size 64 --eps 6.907014853153287e-09 --lr 0.00019923165693702715 --weight_decay 0.01617333419372169
[2022-06-07 18:07:35.778 algo-1:27 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-06-07 18:07:36.300 algo-1:27 INFO profiler_config_parser.py:102] Using config at /opt/ml/input/config/profilerconfig.json.
Running on Device cuda:0
Hyperparameters : LR: 0.00019923165693702715,  Eps: 6.907014853153287e-09, Weight-decay: 0.01617333419372169, Batch Size: 64, Epoch: 2
Data Dir Path: /opt/ml/input/data/train
Model Dir  Path: /opt/ml/model
Output Dir  Path: /opt/ml/output/data

2022-06-07 18:07:52 Training - Training image download completed. Training in progress.[2022-06-07 18:07:41.996 algo-1:27 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2022-06-07 18:07:41.998 algo-1:27 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2022-06-07 18:07:41.999 algo-1:27 INFO hook.py:253] Saving to /opt/ml/output/tensors
[2022-06-07 18:07:42.000 algo-1:27 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2022-06-07 18:07:42.031 algo-1:27 INFO hook.py:584] name:fc.0.weight count_params:524288
[2022-06-07 18:07:42.032 algo-1:27 INFO hook.py:584] name:fc.0.bias count_params:256
[2022-06-07 18:07:42.032 algo-1:27 INFO hook.py:584] name:fc.2.weight count_params:34048
[2022-06-07 18:07:42.032 algo-1:27 INFO hook.py:584] name:fc.2.bias count_params:133
[2022-06-07 18:07:42.033 algo-1:27 INFO hook.py:586] Total Trainable Params: 558725
Epoch 1 - Starting Training phase.
Epoch: 1 - Training Model on Complete Training Dataset!
[2022-06-07 18:07:43.196 algo-1:27 INFO hook.py:413] Monitoring the collections: gradients, losses, relu_input, CrossEntropyLoss_output_0
[2022-06-07 18:07:43.198 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/prestepzero-*-start-1654625256300626.5_train-0-stepstart-1654625263197798.2/python_stats.
[2022-06-07 18:07:43.213 algo-1:27 INFO hook.py:476] Hook is writing from the hook with pid: 27
[2022-06-07 18:07:57.018 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-stepstart-1654625263210557.2_train-0-forwardpassend-1654625277018618.8/python_stats.
[2022-06-07 18:07:58.124 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-forwardpassend-1654625277022003.5_train-1-stepstart-1654625278123822.5/python_stats.
[2022-06-07 18:08:01.287 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-stepstart-1654625278129748.8_train-1-forwardpassend-1654625281287224.0/python_stats.
[2022-06-07 18:08:03.205 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-forwardpassend-1654625281289838.0_train-2-stepstart-1654625283204177.5/python_stats.
[2022-06-07 18:08:06.084 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-stepstart-1654625283209026.2_train-2-forwardpassend-1654625286084397.8/python_stats.
[2022-06-07 18:08:07.841 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-forwardpassend-1654625286086627.2_train-3-stepstart-1654625287840366.5/python_stats.
[2022-06-07 18:08:10.843 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-stepstart-1654625287844353.2_train-3-forwardpassend-1654625290843069.8/python_stats.
[2022-06-07 18:08:12.134 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-forwardpassend-1654625290845300.8_train-4-stepstart-1654625292134514.5/python_stats.
[2022-06-07 18:08:15.163 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-stepstart-1654625292138340.0_train-4-forwardpassend-1654625295163193.8/python_stats.
[2022-06-07 18:08:16.855 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-forwardpassend-1654625295165300.2_train-5-stepstart-1654625296855040.8/python_stats.
[2022-06-07 18:08:19.935 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-stepstart-1654625296859196.5_train-5-forwardpassend-1654625299935231.2/python_stats.
[2022-06-07 18:08:21.170 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-forwardpassend-1654625299936998.8_train-6-stepstart-1654625301169721.2/python_stats.
[2022-06-07 18:08:24.307 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-stepstart-1654625301173561.2_train-6-forwardpassend-1654625304307423.0/python_stats.
[2022-06-07 18:08:25.302 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-forwardpassend-1654625304309247.2_train-7-stepstart-1654625305302043.8/python_stats.
[2022-06-07 18:08:28.446 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-stepstart-1654625305305777.5_train-7-forwardpassend-1654625308446514.2/python_stats.
[2022-06-07 18:08:29.616 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-forwardpassend-1654625308448381.5_train-8-stepstart-1654625309615744.8/python_stats.
[2022-06-07 18:08:32.836 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-stepstart-1654625309624521.2_train-8-forwardpassend-1654625312836115.0/python_stats.
[2022-06-07 18:08:33.679 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-forwardpassend-1654625312838082.8_train-9-stepstart-1654625313678647.8/python_stats.
[2022-06-07 18:08:36.900 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-stepstart-1654625313682626.0_train-9-forwardpassend-1654625316900205.2/python_stats.
[2022-06-07 18:08:38.116 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-forwardpassend-1654625316902344.5_train-10-stepstart-1654625318115573.2/python_stats.
Train set: Average loss: 4.5701, Accuracy: 1076/6680 (16%)
Epoch 1 - Starting Testing phase.
Epoch: 1 - Testing Model on Complete Testing Dataset!
Test set: Average loss: 4.1360, Accuracy: 267/836 (32%)
Epoch 2 - Starting Training phase.
Epoch: 2 - Training Model on Complete Training Dataset!
VanishingGradient: InProgress
Overfit: InProgress
Overtraining: Error
PoorWeightInitialization: InProgress
Train set: Average loss: 3.8852, Accuracy: 2238/6680 (34%)
Epoch 2 - Starting Testing phase.
Epoch: 2 - Testing Model on Complete Testing Dataset!
VanishingGradient: Error
Overfit: InProgress
Overtraining: Error
PoorWeightInitialization: InProgress
Test set: Average loss: 3.5910, Accuracy: 315/836 (38%)
Starting to Save the Model
Completed Saving the Model
INFO:__main__:Running on Device cuda:0
INFO:__main__:Hyperparameters : LR: 0.00019923165693702715,  Eps: 6.907014853153287e-09, Weight-decay: 0.01617333419372169, Batch Size: 64, Epoch: 2
INFO:__main__:Data Dir Path: /opt/ml/input/data/train
INFO:__main__:Model Dir  Path: /opt/ml/model
INFO:__main__:Output Dir  Path: /opt/ml/output/data
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
#015  0%|          | 0.00/97.8M [00:00<?, ?B/s]#015  3%|▎         | 3.20M/97.8M [00:00<00:02, 33.3MB/s]#015  6%|▌         | 5.90M/97.8M [00:00<00:03, 31.6MB/s]#015 12%|█▏        | 11.9M/97.8M [00:00<00:02, 37.1MB/s]#015 18%|█▊        | 17.9M/97.8M [00:00<00:01, 42.3MB/s]#015 26%|██▌       | 25.6M/97.8M [00:00<00:01, 49.4MB/s]#015 34%|███▍      | 33.3M/97.8M [00:00<00:01, 56.0MB/s]#015 41%|████      | 40.1M/97.8M [00:00<00:01, 59.8MB/s]#015 48%|████▊     | 47.3M/97.8M [00:00<00:00, 63.7MB/s]#015 56%|█████▌    | 54.8M/97.8M [00:00<00:00, 67.5MB/s]#015 63%|██████▎   | 61.5M/97.8M [00:01<00:00, 66.4MB/s]#015 70%|███████   | 68.9M/97.8M [00:01<00:00, 69.4MB/s]#015 79%|███████▊  | 76.8M/97.8M [00:01<00:00, 73.0MB/s]#015 87%|████████▋ | 84.7M/97.8M [00:01<00:00, 75.7MB/s]#015 95%|█████████▌| 92.9M/97.8M [00:01<00:00, 78.5MB/s]#015100%|██████████| 97.8M/97.8M [00:01<00:00, 69.6MB/s]
/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py:125: UserWarning: 
Tesla T4 with CUDA capability sm_75 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_35 sm_52 sm_60 sm_61 sm_70 compute_70.
If you want to use the Tesla T4 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
INFO:__main__:Epoch 1 - Starting Training phase.
INFO:__main__:Epoch: 1 - Training Model on Complete Training Dataset!
INFO:__main__:
Train set: Average loss: 4.5701, Accuracy: 1076/6680 (16%)
INFO:__main__:Epoch 1 - Starting Testing phase.
INFO:__main__:Epoch: 1 - Testing Model on Complete Testing Dataset!
INFO:__main__:
Test set: Average loss: 4.1360, Accuracy: 267/836 (32%)
INFO:__main__:Epoch 2 - Starting Training phase.
INFO:__main__:Epoch: 2 - Training Model on Complete Training Dataset!
INFO:__main__:
Train set: Average loss: 3.8852, Accuracy: 2238/6680 (34%)
INFO:__main__:Epoch 2 - Starting Testing phase.
INFO:__main__:Epoch: 2 - Testing Model on Complete Testing Dataset!
INFO:__main__:
Test set: Average loss: 3.5910, Accuracy: 315/836 (38%)
INFO:__main__:Starting to Save the Model
INFO:__main__:Completed Saving the Model
2022-06-07 18:12:55,880 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2022-06-07 18:13:35 Uploading - Uploading generated training model
2022-06-07 18:14:07 Completed - Training job completed
ProfilerReport: NoIssuesFound
Training seconds: 529
Billable seconds: 529
In [21]:
#fetching jobname , client and description to be used for plotting.
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)
In [22]:
print(f"Jobname: {job_name}")
print(f"Client: {client}")
print(f"Description: {description}")
Jobname: pytorch-training-2022-06-07-18-02-54-713
Client: <botocore.client.SageMaker object at 0x7fe2f7557b50>
Description: {'TrainingJobName': 'pytorch-training-2022-06-07-18-02-54-713', 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:training-job/pytorch-training-2022-06-07-18-02-54-713', 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/output/model.tar.gz'}, 'TrainingJobStatus': 'Completed', 'SecondaryStatus': 'Completed', 'HyperParameters': {'batch_size': '64', 'eps': '"6.907014853153287e-09"', 'lr': '"0.00019923165693702715"', 'sagemaker_container_log_level': '20', 'sagemaker_job_name': '"pytorch-training-2022-06-07-18-02-54-713"', 'sagemaker_program': '"train_model.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/source/sourcedir.tar.gz"', 'weight_decay': '"0.01617333419372169"'}, 'AlgorithmSpecification': {'TrainingImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.6-gpu-py36', 'TrainingInputMode': 'File', 'EnableSageMakerMetricsTimeSeries': True}, 'RoleArn': 'arn:aws:iam::881607171913:role/service-role/AmazonSageMaker-ExecutionRole-20220606T220935', 'InputDataConfig': [{'ChannelName': 'train', 'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-east-1-881607171913/dogImagesDataset', 'S3DataDistributionType': 'FullyReplicated'}}, 'CompressionType': 'None', 'RecordWrapperType': 'None'}], 'OutputDataConfig': {'KmsKeyId': '', 'S3OutputPath': 's3://sagemaker-us-east-1-881607171913/'}, 'ResourceConfig': {'InstanceType': 'ml.g4dn.xlarge', 'InstanceCount': 1, 'VolumeSizeInGB': 30}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'CreationTime': datetime.datetime(2022, 6, 7, 18, 2, 55, 404000, tzinfo=tzlocal()), 'TrainingStartTime': datetime.datetime(2022, 6, 7, 18, 5, 4, 7000, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2022, 6, 7, 18, 13, 53, 482000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 15, 15, 915000, tzinfo=tzlocal()), 'SecondaryStatusTransitions': [{'Status': 'Starting', 'StartTime': datetime.datetime(2022, 6, 7, 18, 2, 55, 404000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 6, 7, 18, 5, 4, 7000, tzinfo=tzlocal()), 'StatusMessage': 'Preparing the instances for training'}, {'Status': 'Downloading', 'StartTime': datetime.datetime(2022, 6, 7, 18, 5, 4, 7000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 6, 7, 18, 6, 55, 202000, tzinfo=tzlocal()), 'StatusMessage': 'Downloading input data'}, {'Status': 'Training', 'StartTime': datetime.datetime(2022, 6, 7, 18, 6, 55, 202000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 6, 7, 18, 13, 33, 41000, tzinfo=tzlocal()), 'StatusMessage': 'Training image download completed. Training in progress.'}, {'Status': 'Uploading', 'StartTime': datetime.datetime(2022, 6, 7, 18, 13, 33, 41000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 6, 7, 18, 13, 53, 482000, tzinfo=tzlocal()), 'StatusMessage': 'Uploading generated training model'}, {'Status': 'Completed', 'StartTime': datetime.datetime(2022, 6, 7, 18, 13, 53, 482000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2022, 6, 7, 18, 13, 53, 482000, tzinfo=tzlocal()), 'StatusMessage': 'Training job completed'}], 'EnableNetworkIsolation': False, 'EnableInterContainerTrafficEncryption': False, 'EnableManagedSpotTraining': False, 'TrainingTimeInSeconds': 529, 'BillableTimeInSeconds': 529, 'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-881607171913/', 'CollectionConfigurations': [{'CollectionName': 'CrossEntropyLoss_output_0', 'CollectionParameters': {'eval.save_interval': '1', 'include_regex': 'CrossEntropyLoss_output_0', 'train.save_interval': '10'}}, {'CollectionName': 'relu_input', 'CollectionParameters': {'include_regex': '.*relu_input', 'save_interval': '500'}}, {'CollectionName': 'gradients', 'CollectionParameters': {'save_interval': '500'}}]}, 'DebugRuleConfigurations': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluatorImage': '503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'VanishingGradient'}}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluatorImage': '503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overfit'}}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluatorImage': '503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overtraining'}}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluatorImage': '503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'PoorWeightInitialization'}}], 'DebugRuleEvaluationStatuses': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-vanishinggradient-1bb27834', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 15, 15, 908000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-overfit-01c6a781', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 15, 15, 908000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-overtraining-c05c9c56', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 15, 15, 908000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-poorweightinitialization-cfa54bd9', 'RuleEvaluationStatus': 'IssuesFound', 'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule PoorWeightInitialization at step 0 resulted in the condition being met\n', 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 15, 15, 908000, tzinfo=tzlocal())}], 'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-us-east-1-881607171913/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}}, 'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}], 'ProfilerRuleEvaluationStatuses': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-profilerreport-ad4df406', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2022, 6, 7, 18, 14, 7, 738000, tzinfo=tzlocal())}], 'ProfilingStatus': 'Enabled', 'ResponseMetadata': {'RequestId': '4cb9d052-c6ce-42bd-a60d-9b011a0bd5e5', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4cb9d052-c6ce-42bd-a60d-9b011a0bd5e5', 'content-type': 'application/x-amz-json-1.1', 'content-length': '7012', 'date': 'Tue, 07 Jun 2022 18:17:08 GMT'}, 'RetryAttempts': 0}}
In [23]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
#creating a trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
[2022-06-07 18:17:15.831 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:34 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-06-07 18:17:15.861 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:34 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/debug-output
In [24]:
trial.tensor_names() #all the tensor names
[2022-06-07 18:17:19.955 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:34 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2022-06-07 18:17:21.005 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:34 INFO trial.py:210] Loaded all steps
Out[24]:
['CrossEntropyLoss_output_0',
 'gradient/ResNet_fc.0.bias',
 'gradient/ResNet_fc.0.weight',
 'gradient/ResNet_fc.2.bias',
 'gradient/ResNet_fc.2.weight',
 'layer1.0.relu_input_0',
 'layer1.0.relu_input_1',
 'layer1.0.relu_input_2',
 'layer1.1.relu_input_0',
 'layer1.1.relu_input_1',
 'layer1.1.relu_input_2',
 'layer1.2.relu_input_0',
 'layer1.2.relu_input_1',
 'layer1.2.relu_input_2',
 'layer2.0.relu_input_0',
 'layer2.0.relu_input_1',
 'layer2.0.relu_input_2',
 'layer2.1.relu_input_0',
 'layer2.1.relu_input_1',
 'layer2.1.relu_input_2',
 'layer2.2.relu_input_0',
 'layer2.2.relu_input_1',
 'layer2.2.relu_input_2',
 'layer2.3.relu_input_0',
 'layer2.3.relu_input_1',
 'layer2.3.relu_input_2',
 'layer3.0.relu_input_0',
 'layer3.0.relu_input_1',
 'layer3.0.relu_input_2',
 'layer3.1.relu_input_0',
 'layer3.1.relu_input_1',
 'layer3.1.relu_input_2',
 'layer3.2.relu_input_0',
 'layer3.2.relu_input_1',
 'layer3.2.relu_input_2',
 'layer3.3.relu_input_0',
 'layer3.3.relu_input_1',
 'layer3.3.relu_input_2',
 'layer3.4.relu_input_0',
 'layer3.4.relu_input_1',
 'layer3.4.relu_input_2',
 'layer3.5.relu_input_0',
 'layer3.5.relu_input_1',
 'layer3.5.relu_input_2',
 'layer4.0.relu_input_0',
 'layer4.0.relu_input_1',
 'layer4.0.relu_input_2',
 'layer4.1.relu_input_0',
 'layer4.1.relu_input_1',
 'layer4.1.relu_input_2',
 'layer4.2.relu_input_0',
 'layer4.2.relu_input_1',
 'layer4.2.relu_input_2',
 'relu_input_0']
In [25]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN))
Out[25]:
21
In [26]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL))
Out[26]:
28
In [27]:
#Defining some utility functions to be used for plotting tensors
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

#utility function to get data from tensors
def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

#plot tensor utility functions for plotting tensors
def plot_tensor(trial, tensor_name):

    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("Completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("Completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)
    plt.show()
In [28]:
#plotting the tensor
plot_tensor(trial, "CrossEntropyLoss_output_0")
loaded TRAIN data
loaded EVAL data
Completed TRAIN plot
Completed EVAL plot
In [29]:
# TODO: Display the profiler output
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"Profiler report location: {rule_output_path}")
Profiler report location: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output
In [30]:
! aws s3 ls {rule_output_path} --recursive
2022-06-07 18:13:58     380565 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-report.html
2022-06-07 18:13:57     229714 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2022-06-07 18:13:52        191 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2022-06-07 18:13:52      15554 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2022-06-07 18:13:52        126 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2022-06-07 18:13:52        129 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2022-06-07 18:13:52       2221 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json
2022-06-07 18:13:52        307 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json
2022-06-07 18:13:52        153 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json
2022-06-07 18:13:52        230 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json
2022-06-07 18:13:52       1142 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json
2022-06-07 18:13:52        619 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json
2022-06-07 18:13:52       2209 pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json
In [31]:
! aws s3 cp {rule_output_path} ./ --recursive
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to ProfilerReport/profiler-output/profiler-reports/BatchSize.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json to ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json to ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to ProfilerReport/profiler-output/profiler-reports/Dataloader.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json to ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json to ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json to ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ProfilerReport/profiler-output/profiler-report.ipynb
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json to ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json to ProfilerReport/profiler-output/profiler-reports/StepOutlier.json
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-report.html to ProfilerReport/profiler-output/profiler-report.html
download: s3://sagemaker-us-east-1-881607171913/pytorch-training-2022-06-07-18-02-54-713/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json to ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json
In [32]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]
In [33]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")
Out[33]:
profiler-report

SageMaker Debugger Profiling Report

SageMaker Debugger auto generated this report. You can generate similar reports on all supported training jobs. The report provides summary of training job, system resource usage statistics, framework metrics, rules summary, and detailed analysis from each rule. The graphs and tables are interactive.

Legal disclaimer: This report and any recommendations are provided for informational purposes only and are not definitive. You are responsible for making your own independent assessment of the information.

In [4]:
# Parameters
processing_job_arn = "arn:aws:sagemaker:us-east-1:881607171913:processing-job/pytorch-training-2022-06-0-profilerreport-ad4df406"

Training job summary

System usage statistics

Framework metrics summary

Overview: CPU operators

Overview: GPU operators

Rules summary

The following table shows a profiling summary of the Debugger built-in rules. The table is sorted by the rules that triggered the most frequently. During your training job, the CPUBottleneck rule was the most frequently triggered. It processed 925 datapoints and was triggered 0 times.

Description Recommendation Number of times rule triggered Number of datapoints Rule parameters
CPUBottleneck Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. Consider increasing the number of data loaders or applying data pre-fetching. 0 925 threshold:50
cpu_threshold:90
gpu_threshold:10
patience:1000
StepOutlier Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues. Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers. 0 239 threshold:3
mode:None
n_outliers:10
stddev:3
BatchSize Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization. The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size. 0 898 cpu_threshold_p95:70
gpu_threshold_p95:70
gpu_memory_threshold_p95:70
patience:1000
window:500
Dataloader Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU. Change the number of data loader processes. 0 0 min_threshold:70
max_threshold:200
IOBottleneck Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent. Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance. 0 925 threshold:50
io_threshold:50
gpu_threshold:10
patience:1000
MaxInitializationTime Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes. Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework. 0 239 threshold:20
GPUMemoryIncrease Measures the average GPU memory footprint and triggers if there is a large increase. Choose a larger instance type with more memory if footprint is close to maximum available memory. 0 899 increase:5
patience:1000
window:10
LoadBalancing Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization. Choose a different distributed training strategy or a different distributed training framework. 0 899 threshold:0.2
patience:1000
LowGPUUtilization Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size. Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size. 0 899 threshold_p95:70
threshold_p5:10
window:500
patience:1000

Analyzing the training loop

Step duration analysis

Step durations on node algo-1-27:

The following table is a summary of the statistics of step durations measured on node algo-1-27. The rule has analyzed the step duration from Step:ModeKeys.TRAIN phase. The average step duration on node algo-1-27 was 0.24s. The rule detected 1 outliers, where step duration was larger than 3 times the standard deviation of 1.15s

mean max p99 p95 p50 min
Step Durations in [s] 0.24 13.83 3.22 2.23 0.02 0.02

The following histogram shows the step durations measured on the different nodes. You can turn on or turn off the visualization of histograms by selecting or unselecting the labels in the legend.

GPU utilization analysis

Usage per GPU

Workload balancing

Dataloading analysis

Batch size

CPU bottlenecks

I/O bottlenecks

GPU memory

In [36]:
# Zipping the ProfilerReport inorder to export and upload it later for submission
import shutil
shutil.make_archive("./profiler_report", "zip", "ProfilerReport")
Out[36]:
'/root/CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter/profiler_report.zip'

Model Deploying

In [37]:
# TODO: Deploy your model to an endpoint
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")
------!
In [39]:
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor

#Below is the s3 location of our saved model that was trained by the training job using the best hyperparameters
model_data_artifacts = "s3://sagemaker-us-east-1-881607171913/pytorch-training-220607-1708-003-de60ee22/output/model.tar.gz"

#We need to define the serializer and deserializer that we will be using as default for our Prediction purposes
jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()

#If we need to override the serializer and deserializer then we need to pass them in an class inheriting the Predictor class and pass this class as parameter to our PyTorchModel
class ImgPredictor(Predictor):
    def __init__( self, endpoint_name, sagemaker_session):
        super( ImgPredictor, self).__init__(
            endpoint_name,
            sagemaker_session = sagemaker_session,
            serializer = jpeg_serializer,
            deserializer = json_deserializer
        )
        
pytorch_model = PyTorchModel( model_data = model_data_artifacts,
                            role = role,
                             entry_point= "endpoint_inference.py",
                             py_version = "py36",
                             framework_version = "1.6",
                            predictor_cls = ImgPredictor
                            )

predictor = pytorch_model.deploy( initial_instance_count = 1, instance_type = "ml.t2.medium") #Using ml.t2.medium to save costs
---------!
In [58]:
#Testing the deployed endpoint using some test images
#Solution 1: Using the Predictor object directly.
from PIL import Image
import io
import os
import numpy as np

test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
    test_img = test_images[index]
    expected_breed_category = test_images_expected_output[index]
    print(f"Test image no: {index+1}")
    test_file_path = os.path.join(test_dir,test_img)
    with open(test_file_path , "rb") as f:
        payload = f.read()
        print("Below is the image that we will be testing:")
        display(Image.open(io.BytesIO(payload)))
        print(f"Expected dog breed category no : {expected_breed_category}")
        response = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
        print(f"Response: {response}")
        predicted_dog_breed = np.argmax(response,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
        print(f"Response/Inference for the above image is : {predicted_dog_breed}")
        print("----------------------------------------------------------------------")
Test image no: 1
Below is the image that we will be testing:
Expected dog breed category no : 129
Response: [[0.0, 0.0, 0.0, 1.8310102224349976, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011959796771407127, 0.0, 0.2432364821434021, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.021940354257822037, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8271625638008118, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3083130419254303, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19880977272987366, 1.1451776027679443, 0.0, 0.0, 0.0, 0.8414880037307739, 0.0, 0.02018183469772339, 0.0, 0.2915998101234436, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.32587119936943054, 0.0, 0.20356698334217072, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.725248336791992, 0.0, 0.0, 0.0, 0.0, 1.140865445137024, 0.0, 0.0, 2.3388617038726807, 0.0, 0.0, 0.2683431804180145, 0.0, 0.19908222556114197, 1.6536762714385986, 0.0, 0.475004106760025, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.8256568908691406, 0.12499096989631653, 0.07368388772010803, 0.0, 0.0]]
Response/Inference for the above image is : [129]
----------------------------------------------------------------------
Test image no: 2
Below is the image that we will be testing:
Expected dog breed category no : 5
Response: [[0.0, 0.0, 0.0, 0.7785679697990417, 0.0, 0.0, 0.6032100915908813, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21504449844360352, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6418353319168091, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7718003392219543, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2406315952539444, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9794938564300537, 0.24579012393951416, 0.8128808736801147, 0.0, 0.0, 0.0, 0.6243339776992798, 0.0, 0.2508496344089508, 0.0, 0.0, 0.0, 0.05432766303420067, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.10714370012283325, 0.0, 1.101265549659729, 0.39284422993659973, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.012158554047346115, 0.0, 0.0, 0.0, 2.433807849884033, 0.0, 0.0, 0.0, 0.0, 0.8312989473342896, 0.0, 0.0, 1.8716989755630493, 0.0, 0.0, 0.0, 0.0, 0.0, 2.610265016555786, 0.0, 0.9424132108688354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7487382888793945, 0.36241140961647034, 0.27414369583129883, 0.0, 0.0]]
Response/Inference for the above image is : [129]
----------------------------------------------------------------------
Test image no: 3
Below is the image that we will be testing:
Expected dog breed category no : 21
Response: [[1.0307153463363647, 0.0, 0.0, 1.181191325187683, 0.0, 0.0, 0.0, 0.0, 0.07557682693004608, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0733839273452759, 1.0816701650619507, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.3238658905029297, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5609280467033386, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.09209688007831573, 0.4193204343318939, 0.4013954699039459, 0.0, 0.0, 0.0, 0.4956963360309601, 0.0, 0.5312991142272949, 0.0, 0.37659335136413574, 0.0, 0.010736760683357716, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3502495586872101, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6166532039642334, 0.0, 0.0, 0.0, 1.9142392873764038, 0.0, 0.0, 0.0, 0.0, 0.31606820225715637, 0.47971397638320923, 0.0, 2.056882858276367, 0.0, 0.0, 0.0, 0.0, 0.0, 0.43582630157470703, 0.0, 0.3661939799785614, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.030293241143226624, 0.0, 0.0, 0.0, 0.0, 3.551938533782959, 0.0, 0.41795605421066284, 0.0, 0.0]]
Response/Inference for the above image is : [129]
----------------------------------------------------------------------
In [59]:
print(predictor.endpoint_name)
endpoint_name = predictor.endpoint_name
pytorch-inference-2022-06-07-18-28-51-111
In [60]:
# Solution 2: Using boto3
# Using the runtime boto3 client to test the deployed models endpoint
import os
import io
import boto3
import json
import base64
import PIL
# setting the  environment variables

ENDPOINT_NAME = endpoint_name
# We will be using the AWS's lightweight runtime solution to invoke an endpoint.
runtime= boto3.client('runtime.sagemaker')
test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
    test_img = test_images[index]
    expected_breed_category = test_images_expected_output[index]
    print(f"Test image no: {index+1}")
    test_file_path = os.path.join(test_dir,test_img)
    with open(test_file_path , "rb") as f:
        payload = f.read()
        print("Below is the image that we will be testing:")
        display(Image.open(io.BytesIO(payload)))
        print(f"Expected dog breed category no : {expected_breed_category}")
        response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='image/jpeg',
                                       Body=payload)
        response_body = np.asarray(json.loads( response['Body'].read().decode('utf-8')))        
        print(f"Response: {response_body}")        
        predicted_dog_breed = np.argmax(response_body,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
        print(f"Response/Inference for the above image is : {predicted_dog_breed}")
Test image no: 1
Below is the image that we will be testing:
Expected dog breed category no : 129
Response: [[0.         0.         0.         1.83101022 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.0119598  0.         0.24323648 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.02194035 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.82716256 0.
  0.         0.         0.         0.         0.30831304 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.19880977
  1.1451776  0.         0.         0.         0.841488   0.
  0.02018183 0.         0.29159981 0.         0.         0.
  0.         0.         0.         0.         0.         0.3258712
  0.         0.20356698 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         2.72524834 0.         0.         0.         0.
  1.14086545 0.         0.         2.3388617  0.         0.
  0.26834318 0.         0.19908223 1.65367627 0.         0.47500411
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         2.82565689 0.12499097 0.07368389 0.
  0.        ]]
Response/Inference for the above image is : [129]
Test image no: 2
Below is the image that we will be testing:
Expected dog breed category no : 5
Response: [[0.         0.         0.         0.77856797 0.         0.
  0.60321009 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.2150445  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.64183533 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.77180034 0.
  0.         0.         0.         0.         0.2406316  0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.97949386 0.24579012
  0.81288087 0.         0.         0.         0.62433398 0.
  0.25084963 0.         0.         0.         0.05432766 0.
  0.         0.         0.         0.         0.         0.1071437
  0.         1.10126555 0.39284423 0.         0.         0.
  0.         0.         0.         0.01215855 0.         0.
  0.         2.43380785 0.         0.         0.         0.
  0.83129895 0.         0.         1.87169898 0.         0.
  0.         0.         0.         2.61026502 0.         0.94241321
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         2.74873829 0.36241141 0.2741437  0.
  0.        ]]
Response/Inference for the above image is : [129]
Test image no: 3
Below is the image that we will be testing:
Expected dog breed category no : 21
Response: [[1.03071535 0.         0.         1.18119133 0.         0.
  0.         0.         0.07557683 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.07338393 1.08167017 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.32386589 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.56092805 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.09209688 0.41932043
  0.40139547 0.         0.         0.         0.49569634 0.
  0.53129911 0.         0.37659335 0.         0.01073676 0.
  0.         0.         0.         0.         0.         0.35024956
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.6166532  0.         0.
  0.         1.91423929 0.         0.         0.         0.
  0.3160682  0.47971398 0.         2.05688286 0.         0.
  0.         0.         0.         0.4358263  0.         0.36619398
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.03029324 0.         0.
  0.         0.         3.55193853 0.         0.41795605 0.
  0.        ]]
Response/Inference for the above image is : [129]

From the above test images we can see that from the 3 test images, model was able to correctly predict the breed of the first image.

In [61]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()
In [ ]: